NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Leveraging Large Language Models to Detect npm Malicious Packages

Zahan, Nusrat; Burckhardt, Philipp; Lysenko, Mikola; Aboukhadijeh, Feross; Williams, Laurie (April 2025, IEEE)

Existing malicious code detection techniques demand the integration of multiple tools to detect different malware patterns, often suffering from high misclassification rates. Therefore, malicious code detection techniques could be enhanced by adopting advanced, more automated approaches to achieve high accuracy and a low misclassification rate. The goal of this study is to aid security analysts in detecting malicious packages by empirically studying the effectiveness of Large Language Models (LLMs) in detecting malicious code. We present SocketAI, a malicious code review workflow to detect malicious code. To evaluate the effectiveness SocketAI, we leverage a benchmark dataset of 5,115 npm packages, of which 2,180 packages have malicious code. We conducted a baseline comparison of GPT-3 and GPT-4 models with the state-of-the-art CodeQL static analysis tool, using 39 custom CodeQL rules developed in prior research to detect malicious Javascript code. We also compare the effectiveness of static analysis as a pre-screener with SocketAI workflow, measuring the number of files that need to be analyzed and the associated costs. Additionally, we performed a qualitative study to understand the types of malicious packages detected or missed by our workflow. Our baseline comparison demonstrates a 16% and 9% improvement over static analysis in precision and F1 scores, respectively. GPT-4 achieves higher accuracy with 99% precision and 97% F1 scores, while GPT-3 offers a more cost-effective balance at 91% precision and 94% F1 scores. Prescreening files with a static analyzer reduces the number of files requiring LLM analysis by 77.9% and decreases costs by 60.9% for GPT-3 and 76.1% for GPT-4. Our qualitative analysis identified data theft, execution of arbitrary code, and suspicious domain categories as the top detected malicious packages.
more » « less
Free, publicly-accessible full text available April 28, 2026
Research Directions in Software Supply Chain Security

https://doi.org/10.1145/3714464

Williams, Laurie; Benedetti, Giacomo; Hamer, Sivana; Paramitha, Ranindya; Rahman, Imranur; Tamanna, Mahzabin; Tystahl, Greg; Zahan, Nusrat; Morrison, Patrick; Acar, Yasemin; et al (June 2025, ACM Transactions on Software Engineering and Methodology)

Reusable software libraries, frameworks, and components, such as those provided by open source ecosystems and third-party suppliers, accelerate digital innovation. However, recent years have shown almost exponential growth in attackers leveraging these software artifacts to launch software supply chain attacks. Past well-known software supply chain attacks include the SolarWinds, log4j, and xz utils incidents. Supply chain attacks are considered to have three major attack vectors: through vulnerabilities and malware accidentally or intentionally injected into open source and third-partydependencies/components/containers; by infiltrating thebuild infrastructureduring the build and deployment processes; and through targeted techniques aimed at thehumansinvolved in software development, such as through social engineering. Plummeting trust in the software supply chain could decelerate digital innovation if the software industry reduces its use of open source and third-party artifacts to reduce risks. This article contains perspectives and knowledge obtained from intentional outreach with practitioners to understand their practical challenges and from extensive research efforts. We then provide an overview of current research efforts to secure the software supply chain. Finally, we propose a future research agenda to close software supply chain attack vectors and support the software industry.
more » « less
Free, publicly-accessible full text available June 30, 2026
MalwareBench: Malware Samples are Not Enough

Zahan, Nusrat; Burckhardt, Philipp; Lysenko, Mikola; Aboukhadijeh, Feross; Williams, Laurie (April 2024, Proceedings of the IEEE/ACM International Conference on Mining Software Repositories (MSR))

The prevalent use of third-party components in modern software development, coupled with rapid modernization and digitization, has significantly amplified the risk of software supply chain security attacks. Popular large registries like npm and PyPI are highly targeted malware distribution channels for attackers due to heavy growth and dependence on third-party components. Industry and academia are working towards building tools to detect malware in the software supply chain. However, a lack of benchmark datasets containing both malware and neutral packages hampers the evaluation of the performance of these malware detection tools. The goal of our study is to aid researchers and tool developers in evaluating and improving malware detection tools by contributing a benchmark dataset built by systematically collecting malicious and neutral packages from the npm and PyPI ecosystems. We present MalwareBench, a labeled dataset of 20,534 packages (of which 6,475 are malicious) of npm and PyPI ecosystems. We constructed the benchmark dataset by incorporating pre-existing malware datasets with the Socket internal benchmark data and including popular and newly released npm and PyPI packages. The ground truth labels of these packages were determined using the Socket AI Scanner and manual inspection.
more » « less
OpenSSF Scorecard: On the Path Toward Ecosystem-Wide Automated Security Metrics

https://doi.org/10.1109/MSEC.2023.3279773

Zahan, Nusrat; Kanakiya, Parth; Hambleton, Brian; Shohan, Shohanuzzaman; Williams, Laurie (November 2023, IEEE Security & Privacy)

The OpenSSF Scorecard project is an automated tool to monitor the security health of open-source software. This study evaluates the applicability of the Scorecard tool and compares the security practices and gaps in the npm and PyPI ecosystems.
more » « less
Full Text Available
Do Software Security Practices Yield Fewer Vulnerabilities?

https://doi.org/10.1109/ICSE-SEIP58684.2023.00032

Zahan, Nusrat; Shohan, Shohanuzzaman; Harris, Dan; Williams, Laurie (May 2023, IEEE)

Due to the ever-increasing number of security breaches, practitioners are motivated to produce more secure software. In the United States, the White House Office released a memorandum on Executive Order (EO) 14028 that mandates organizations provide self-attestation of the use of secure software development practices. The OpenSSF Scorecard project allows practitioners to measure the use of software security practices automatically. However, little research has been done to determine whether the use of security practices improves package security, particularly which security practices have the biggest impact on security outcomes. The goal of this study is to assist practitioners and researchers in making informed decisions on which security practices to adopt through the development of models between software security practice scores and security vulnerability counts. To that end, we developed five supervised machine learning models for npm and PyPI packages using the OpenSSF Scorecard security practices scores and aggregate security scores as predictors and the number of externally-reported vulnerabilities as a target variable. Our models found that four security practices (Maintained, Code Review, Branch Protection, and Security Policy) were the most important practices influencing vulnerability count. However, we had low R2 (ranging from 9% to 12%) when we tested the models to predict vulnerability counts. Additionally, we observed that the number of reported vulnerabilities increased rather than reduced as the aggregate security score of the packages increased. Both findings indicate that additional factors may influence the package vulnerability count. Other factors, such as the scarcity of vulnerability data, time to implicate security practices vs. time to detect vulnerabilities, and the need for more adequate scripts to detect security practices, may impede the data-driven studies to indicate that practice can aid in reducing externally-reported vulnerabilities. We suggest that vulnerability count and security score data be refined so that these measures can be used to provide actionable guidance on security practices.
more » « less
Software Bills of Materials Are Required. Are We There Yet?

https://doi.org/10.1109/MSEC.2023.3237100

Zahan, Nusrat; Lin, Elizabeth; Tamanna, Mahzabin; Enck, William; Williams, Laurie (March 2023, IEEE Security & Privacy)

Full Text Available
Do I really need all this work to find vulnerabilities, https://doi.org/10.1007/s10664-022-10179-6

https://doi.org/https://doi.org/10.1007/s10664-022-10179-6

Elder, Sarah; Zahan, Nusrat; Shu, Rui; Metro, Monica; Kozarev, Valeri; Menzies, Tim; Williams, Laurie (August 2022, Empirical software engineering)

Context: Applying vulnerability detection techniques is one of many tasks using the limited resources of a software project. Objective: The goal of this research is to assist managers and other decision- makers in making informed choices about the use of software vulnerability detection techniques through an empirical study of the efficiency and effectiveness of four techniques on a Java-based web application. Method: We apply four different categories of vulnerability detection techniques – systematic manual penetration testing (SMPT), exploratory manual penetration testing (EMPT), dynamic application security testing (DAST), and static application security testing (SAST) – to an open-source medical records system. Results: We found the most vulnerabilities using SAST. However, EMPT found more severe vulnerabilities. With each technique, we found unique vulnerabilities not found using the other techniques. The efficiency of manual techniques (EMPT, SMPT) was comparable to or better than the efficiency of automated techniques (DAST, SAST) in terms of Vulnerabilities per Hour (VpH). Conclusions: The vulnerability detection technique practitioners should select may vary based on the goals and available resources of the project. If the goal of an organization is to find “all” vulnerabilities in a project, they need to use as many techniques as their resources allow.
more » « less
Full Text Available
Structuring a Comprehensive Software Security Course Around the OWASP Application Security Verification Standard

Elder, Sarah; Zahan, Nusrat; Kozarev, Val; Shu, Rui; Menzies, Tim; Williams, Laurie (May 2021, Proceedings of the International Conference on Software Engineering)
null (Ed.)
Lack of security expertise among software practitioners is a problem with many implications. First, there is a deficit of security professionals to meet current needs. Additionally, even practitioners who do not plan to work in security may benefit from an increased understanding of security. The goal of this paper is to aid software engineering educators in designing a comprehensive software security course by sharing an experience running a software security course for the eleventh time. Through all the eleven years of running the software security course, the course objectives have been comprehensive -- ranging from security testing, to secure design and coding, to security requirements to security risk management. For the first time in this eleventh year, a theme of the course assignments was to map vulnerability discovery to the security controls of the Open Web Application Security Project (OWASP) Application Security Verification Standard (ASVS). Based upon student performance on a final exploratory penetration testing project, this mapping may have increased students' depth of understanding of a wider range of security topics. The students efficiently detected 191 unique and verified vulnerabilities of 28 different Common Weakness Enumeration (CWE) types during a three-hour period in the OpenMRS project, an electronic health record application in active use.
more » « less
Full Text Available

Search for: All records